Expressive Power of Tree and String Based Wrappers
نویسندگان
چکیده
There exist two types of wrappers: the string based wrapper such as the LR wrapper, and the tree based wrapper. A tree based wrapper designates extraction regions by nodes on the trees of semistructured documents. The tree based wrapper seems to be more powerful than the string based one. There exist, however, many HTML documents on the Web such that a standard tree based wrapper fails to extract contents because they are structured by presentational tags, punctuation symbols, and white spaces. Moreover, some of such documents use multi-byte characters for structuring. To treat some of such documents, we propose automatic wrapper generation based on common substring detection and to use input documents without any modification. In this framework, a part of text elements including white spaces and multibyte characters can be a part of a wrapper. We show the superiority such wrappers to usual wrappers created after document are parsed and modified. However, there still exist HTML documents such that wrappers with text elements fail to extract contents. Thus, we propose another class of wrappers, called the regional tree wrapper, which utilize the tree structures of input documents as well as addressing functions on strings.
منابع مشابه
Minimalist Syntax, Multiple Regular Tree Grammars and Direction Preserving Tree Transductions1
Model-theoretic syntax deals with the logical characterization of complexity classes. The first results in this area were obtained in the early and late Sixties of the last century. In these results it was established that languages recognised by finite string and tree automata are definable by means of monadic second-order logic (MSO). To be slightly more precise, the classical results, just m...
متن کاملForest-to-String Statistical Translation Rules
In this paper, we propose forest-to-string rules to enhance the expressive power of tree-to-string translation models. A forestto-string rule is capable of capturing nonsyntactic phrase pairs by describing the correspondence between multiple parse trees and one string. To integrate these rules into tree-to-string translation models, auxiliary rules are introduced to provide a generalization lev...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملLearning (k, l)-Contextual Tree Languages for Information Extraction
Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this s...
متن کاملVoltage Sag Compensation with DVR in Power Distribution System Based on Improved Cuckoo Search Tree-Fuzzy Rule Based Classifier Algorithm
A new technique presents to improve the performance of dynamic voltage restorer (DVR) for voltage sag mitigation. This control scheme is based on cuckoo search algorithm with tree fuzzy rule based classifier (CSA-TFRC). CSA is used for optimizing the output of TFRC so the classification output of the network is enhanced. While, the combination of cuckoo search algorithm, fuzzy and decision tree...
متن کامل